Read the file

house <- read.csv("data_input/train.csv")
df_house <- read.csv("data_input/train.csv")
str(house)

Data description

1. SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. 2. MSSubClass: The building class 3. MSZoning: The general zoning classification 4. LotFrontage: Linear feet of street connected to property 5. LotArea: Lot size in square feet 6. Street: Type of road access 7. Alley: Type of alley access 8. LotShape: General shape of property 9. LandContour: Flatness of the property 10. Utilities: Type of utilities available 11. LotConfig: Lot configuration 12. LandSlope: Slope of property 13. Neighborhood: Physical locations within Ames city limits 14. Condition1: Proximity to main road or railroad 15. Condition2: Proximity to main road or railroad (if a second is present) 16. BldgType: Type of dwelling 17. HouseStyle: Style of dwelling 18. OverallQual: Overall material and finish quality 19. OverallCond: Overall condition rating 20. YearBuilt: Original construction date 21. YearRemodAdd: Remodel date 22. RoofStyle: Type of roof 23. RoofMatl: Roof material 24. Exterior1st: Exterior covering on house 25. Exterior2nd: Exterior covering on house (if more than one material) 26. MasVnrType: Masonry veneer type 27. MasVnrArea: Masonry veneer area in square feet 28. ExterQual: Exterior material quality 29. ExterCond: Present condition of the material on the exterior 30. Foundation: Type of foundation 31. BsmtQual: Height of the basement 32. BsmtCond: General condition of the basement 33. BsmtExposure: Walkout or garden level basement walls 34. BsmtFinType1: Quality of basement finished area 35. BsmtFinSF1: Type 1 finished square feet 36. BsmtFinType2: Quality of second finished area (if present) 37. BsmtFinSF2: Type 2 finished square feet 38. BsmtUnfSF: Unfinished square feet of basement area 39. TotalBsmtSF: Total square feet of basement area 40. Heating: Type of heating 41. HeatingQC: Heating quality and condition 42. CentralAir: Central air conditioning 43. Electrical: Electrical system 44. 1stFlrSF: First Floor square feet 45. 2ndFlrSF: Second floor square feet 46. LowQualFinSF: Low quality finished square feet (all floors) 47. GrLivArea: Above grade (ground) living area square feet 48. BsmtFullBath: Basement full bathrooms 49. BsmtHalfBath: Basement half bathrooms 50. FullBath: Full bathrooms above grade 51. HalfBath: Half baths above grade 52. Bedroom: Number of bedrooms above basement level 53. Kitchen: Number of kitchens 54. KitchenQual: Kitchen quality 55. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) 56. Functional: Home functionality rating 57. Fireplaces: Number of fireplaces 58. FireplaceQu: Fireplace quality 59. GarageType: Garage location 60. GarageYrBlt: Year garage was built 61. GarageFinish: Interior finish of the garage 62. GarageCars: Size of garage in car capacity 63. GarageArea: Size of garage in square feet 64. GarageQual: Garage quality 65. GarageCond: Garage condition 66. PavedDrive: Paved driveway 67. WoodDeckSF: Wood deck area in square feet 68. OpenPorchSF: Open porch area in square feet 69. EnclosedPorch: Enclosed porch area in square feet 70. 3SsnPorch: Three season porch area in square feet 71. ScreenPorch: Screen porch area in square feet 72. PoolArea: Pool area in square feet 73. PoolQC: Pool quality 74. Fence: Fence quality 75. MiscFeature: Miscellaneous feature not covered in other categories 76. MiscVal: $Value of miscellaneous feature 77. MoSold: Month Sold 78. YrSold: Year Sold 79. SaleType: Type of sale 80. SaleCondition: Condition of sale

The library that will be use

library(ggplot2)
library(hrbrthemes)
library(dplyr)
library(tidyr)
library(viridis)
library(lattice)
library(GGally)
library(fitdistrplus)
library(plotly)
library(scales)

Analysing first

hist(house$SalePrice, xlab = "House Price", col = "magenta", breaks = 50)+ 
  scale_y_continuous(labels = comma)+
  scale_x_continuous(labels = comma)

### Interpretations : Have appreciable positive skewness Deviate from the normal distribution The bigger the price, the smaller the quantity

relationship beetwwen sales price and house area

first make a category for the year build

house$age_category = with(house, ifelse(
  YearBuilt <= 1908, "1872-1908", ifelse(
    YearBuilt <= 1943, "1909-1943", ifelse(
      YearBuilt <= 1978, "1944-1978", ifelse(
        YearBuilt <= 2010, "1978-2010", ">2010"
  )))))
ggplot(data = house, 
       mapping = aes(x = GrLivArea,y = SalePrice, color = age_category)) +
  geom_point(size=2, shape = 16)+
  labs(title = "Sale Price & Ground Living Area",
       y = "sale price",
       x = "Ground Living Area",
       color = "Year Build")+ 
  scale_y_continuous(labels = comma)

interpretations:

sale price & ground living area have a linear relationship the bigger the area, the more expensive the price goes

relationship beetween sales price and basement area

ggplot(data = house, 
       mapping = aes(x = TotalBsmtSF,y = SalePrice, color = age_category)) +
  geom_point(size=2, shape = 16)+
  labs(title = "Sale Price & Total basement area",
       y = "sale price",
       x = "Total basement area",
       color = "Year Build")+ 
  scale_y_continuous(labels = comma)

### interpretations: I see, so total basement also has a strong linear relationship with sale price. just like ground living area, the bigger the basement area, the more expensive it comes

unique(house$YearBuilt)
##   [1] 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006 1960
##  [16] 1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966 1959 1994
##  [31] 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972 1921 1945 1982
##  [46] 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985 1979 1885 1919 1990
##  [61] 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926 1940 1941 1987 1986 2008
##  [76] 1908 1892 1916 1932 1918 1912 1947 1925 1900 1980 1989 1992 1949 1880 1928
##  [91] 1978 1922 1996 2010 1946 1913 1937 1942 1938 1974 1893 1914 1906 1890 1898
## [106] 1904 1882 1875 1911 1917 1872 1905

relationship beetween sales price and the quality of the house

ggplot(data = house,
       mapping = aes(x = OverallQual, y = SalePrice, group = OverallQual))+ 
  geom_boxplot(mapping = aes(fill = OverallQual))+
    scale_fill_gradient(low = "#f0bf00", high = "#58b258") +
   labs(title = "Sale Price & House Quality",
       y = "House Quality",
       x = "sale price",
       color = NULL) +
  theme(plot.title = element_text(face="bold"))+ 
  scale_y_continuous(labels = comma)

### interpretations : house quality and sale price are good friends. based on the plot, the better the quality, the more expensive it gets. just like stairs, they keep going up

ggplot(data = house,
       mapping = aes(x = age_category, y =SalePrice)) +
  geom_col(aes(fill = "lightblue"))+
  theme(legend.position="none")+ 
  scale_y_continuous(labels = comma)

### interpretations : so maybe modern House is attract people more? note we still dont know if saleprice is in constant price

data <- house[ , c("SalePrice", "OverallQual", "GrLivArea","GarageCars","TotalBsmtSF", "FullBath", "YearBuilt","age_category")]
ggcorr(data, label = T)

interpretations :

mostly all the variable is correlated to SalePrice especially OverallQual but FullBath?? Really?

plt <- ggpairs(data, columns = 1:7, ggplot2::aes(col = age_category),upper = list(continuous = wrap("cor", size = 3))) + 
  scale_y_continuous(labels = comma)+
  scale_x_continuous(labels = comma)
ggsave("plot.png", plot = plt, width = 15, height = 15, units = "in", dpi = 300)
knitr::include_graphics("plot.png")

#### no Interpretations for this plot because it’s just the same just like before, just want to spoil your eyes :D

gg <- ggplot(data = house , aes(x=SalePrice)) +  
  geom_histogram(aes(y = ..density.., fill=age_category),bins = 30, alpha = 0.7)+ 
  geom_density(aes(color=age_category))+  
  geom_rug(aes(color=age_category))+ 
  labs(x = '',  
       y = '',  
       title = 'Distplot with Normal Distribution')+ 
  scale_y_continuous(labels = comma)+ 
  scale_x_continuous(labels = comma)    

ggplotly(gg)%>% 
  layout(plot_bgcolor='#e5ecf6',   
             xaxis = list(   
               title='Sale Price', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='Density', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff')) 

Missing data

df_house[!complete.cases(df_house),]
sum(is.na(df_house))
## [1] 6965

Search for normality

FIT <- fitdist(df_house$SalePrice, "norm")    
plot(FIT) + 
  scale_y_continuous(labels = comma)+
  scale_x_continuous(labels = comma)  

## NULL

It’s look like Sale Price have skewdness, and that’s not a big problem because a simple log function will do the magic work

df_house$SalePrice <- log(df_house$SalePrice)
FIT2 <- fitdist(df_house$SalePrice, "norm")    
plot(FIT2) 

Yup! Sale Price looking better than before

FIT3 <- fitdist(df_house$GrLivArea, "norm")    
plot(FIT3) 

some skewdness again happen to GrLivArea

df_house$GrLivArea <- log(df_house$GrLivArea) 
FIT4 <- fitdist(df_house$GrLivArea, "norm")    
plot(FIT4)+scale_y_continuous(labels = comma)+
  scale_x_continuous(labels = comma)  

## NULL

Splendid!

FIT5 <- fitdist(df_house$TotalBsmtSF, "norm")    
plot(FIT5) +scale_y_continuous(labels = comma)

## NULL

hmmm let’s do another log then

FIT6 <- fitdist(df_house$TotalBsmtSF, "norm")    
plot(FIT6) +scale_y_continuous(labels = comma)

## NULL

That’s not looking great so we have ;

Skewdness A significant number of observations with value zero (houses without basement) And 0 value that make us cant do log transformation.

To apply a log transformation here, we’ll create a variable/filter that can get the effect of having or not having basement (binary variable). Then, we’ll do a log transformation to all the non-zero observations. Ignoring the 0 value so we can do Log transformation

df_house <- 
  df_house %>%  filter(df_house$TotalBsmtSF > 0)
df_house$TotalBsmtSF <- log(df_house$TotalBsmtSF) 
FIT7 <- fitdist(df_house$TotalBsmtSF, "norm") 
plot(FIT7) 

# nice

df_house$age_category = with(df_house, ifelse(
  YearBuilt <= 1908, "1872-1908", ifelse(
    YearBuilt <= 1943, "1909-1943", ifelse(
      YearBuilt <= 1978, "1944-1978", ifelse(
        YearBuilt <= 2010, "1978-2010", ">2010"
  )))))

#Homoscedasticity

ggplot(data = df_house, 
       mapping = aes(x = GrLivArea,y = SalePrice, color = age_category)) +
  geom_point(size=2, shape = 16)+
  labs(title = "Sale Price & Ground Living Area after Norm",
       y = "sale price",
       x = "Ground Living Area",
       color = "Year Build")

ggplot(data = df_house, 
       mapping = aes(x = TotalBsmtSF,y = SalePrice, color = age_category)) +
  geom_point(size=2, shape = 16)+
  labs(title = "Sale Price & Total Basement Area after Norm",
       y = "sale price",
       x = "Total Basement Area",
       color = "Year Build")

### because the current scatter plot doesn’t have a conic shape anymore.So we solved the homoscedasticity problem.